Search CORE

135 research outputs found

Measures for corpus similarity and homogeneity

Author: Kilgarriff Adam
Russell-Rose Tony
Publication venue
Publication date: 01/01/1998
Field of study

How similar are two corpora? A measure of corpus similarity would be very useful for NLP for many purposes, such as estimating the work involved in porting a system from one domain to another. First, we discuss difficulties in identifying what we mean by 'corpus similariti: human similarity judgements are not finegrained enough, corpus similarity is inherently multidimensional, and similarity can only be interpreted in the light of corpus homogeneity. We then present an operational definition of corpus similarity \vhich addresses or circumvents the problems, using purpose-built sets of aknown-similarity corpora". These KSC sets can be used to evaluate the measures. We evaluate the measures described in the literature, including three variants of the information theoretic measure 'perplexity'. A x 2-based measure, using word frequencies, is shnwn to be the best of those tested. The Problem How similar arc two corpora? The question arises on many occasions. In NLP, many useful results can be generated from corpora, but when can the results developed using one corpus be applied to another? How much will it cost to port an NLP application from one domain, with one corpus, to another, with another? For linguistics, does it matter whether language researchers use this corpora or that, or are they similar enough for it to mal<e no difference? There are also questions of more general interest. Looking at British national newspapers: is the Independent more like the Guardian or the Telegraph?' What are the constraints on a measure for corpus similarity? The first is simply that its findings correspond to unequivocal human judgements. It mus

CiteSeerX

Goldsmiths Research Online

Effective Corpus Virtualization

Author: Jakubíček Miloš
Kilgarriff Adam
Rychlý Pavel
Publication venue: EUROPEAN LANGUAGE RESOURCES ASSOCIATION-ELRA
Publication date: 01/01/2014
Field of study

In this paper we describe an implementation of corpus virtualization within the Manatee corpus management system. Under corpus virtualization we understand logical manipulation with corpora or their parts grouping them into new (virtual) corpora. We discuss the motivation for such a setup in detail and show space and time efficiency of this approach evaluated on a 11 billion word corpus of Spanish

Univerzitní repozitář Masarykovy univerzity

Gap-fill Tests for Language Learners: Corpus-Driven Item Generation

Author: Avinesh P.V.S.
Kilgarriff Adam
Smith Simon
Publication venue
Publication date: 01/01/2010
Field of study

Coventry University Pure Portal

The Sketch Engine as infrastructure for historical corpora

Author: Husák Miloš
Kilgarriff Adam
Woodrow Robyn
Publication venue
Publication date: 01/01/2012
Field of study

A part of the case for corpus building is always that the corpus will have many users and uses. For that, it must be easy to use. A tool and web service that makes it easy is the Sketch Engine. It is commercial, but this can be advantageous: it means that the costs and maintenance of the service are taken care of. All parties stand to gain: the resource developers both have their resource showcased for no cost, and get to use the resource within the Sketch Engine themselves (often also at no cost). Other users benefit from the functions and features of the Sketch Engine. The tool already plays this role in relation to four historical corpora, three of which are briefly presented

Univerzitní repozitář Masarykovy univerzity

Setting up for corpus lexicography

Author: Jakubíček Miloš
Kilgarriff Adam
Pomikálek Jan
Whitelock Pete
Publication venue: Department of Linguistics and Scandinavian Studies, University of Oslo
Publication date: 01/01/2012
Field of study

There are many benefits to using corpora. In order to reap those rewards, how should someone who is setting up a dictionary project proceed? We describe a practical experience of such ‘setting up’ for a new Portuguese-English, English-Portuguese dictionary being written at Oxford University Press. We focus on the Portuguese side, as OUP did not have Portuguese resources prior to the project. We collected a very large (3.5 billion word) corpus from the web, including removing all unwanted material and duplicates. We then identified the best tools for Portuguese for lemmatizing and parsing, and undertook the very large task of parsing it. We then used the dependency parses, as output by the parser, to create word sketches (one page summaries of a word’s grammatical and collocational behavior). We plan to customize an existing system for automatically identifying good candidate dictionary examples, to Portuguese, and add salient information about regional words to the word sketches. All of the data and associated support tools for lexicography are available to the lexicographer in the Sketch Engine corpus query system

Univerzitní repozitář Masarykovy univerzity

Fast syntactic searching in very large corpora for many languages

Author: Jakubicek Milos
Kilgarriff Adam
McCarth Diana
Rychly Pavel
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository

Shared-task evaluations in HLT

Author: Belz Anya
Kilgarriff Adam
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/07/2006
Field of study

DCU Online Research Access Service